1 Abstract
The muHVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below :
Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective
Data Projection: Dimension projection of the compressed cells to 1D,2D and 3D with the Sammons Non-linear Algorithm. This step creates topology preserving map coordinates into the desired output dimension. Additionally, embedding can be learned from the compressed data to further enhance the representation of the multivariate data. Embedding provide a lower-dimensional space where high-dimensional vectors, such as word vectors, can be translated, making it easier to perform machine learning tasks on large inputs.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks
Prediction: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required
2 Data Understanding
In this vignette, we will use the
Prices of Personal Computers dataset. This
dataset contains 6261 observations and 6 features. The dataset observes
the price from 1993 to 1995 of 486 personal computers in the US. The
variables are price, speed, hd, ram, screen and ads.
Here, we load the training and the testing dataset.
set.seed(240)
# Load data from csv files
trainComputers <- read.csv("https://raw.githubusercontent.com/Mu-Sigma/muHVT/dev/vignettes/sample_dataset/trainComputers.csv")
testComputers <- read.csv("https://raw.githubusercontent.com/Mu-Sigma/muHVT/dev/vignettes/sample_dataset/testComputers.csv")Let’s have a look at the scaled training dataset containing 5008 data points
# Quick peek
trainComputers <- scale(trainComputers)
metric_list <- colnames(trainComputers)
scale_attr <- attributes(trainComputers)
trainComputers <- trainComputers %>% as.data.frame()
DT::datatable(trainComputers, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now let us check the structure of the training data and analyse its summary.
str(trainComputers)
#> 'data.frame': 5008 obs. of 6 variables:
#> $ price : num -1.297 -0.8 -1.136 -0.709 1.721 ...
#> $ speed : num -1.195 -0.783 -1.195 -1.195 -0.783 ...
#> $ hd : num -1.3136 -1.2898 -0.8854 -0.8854 -0.0767 ...
#> $ ram : num -0.718 -1.108 -0.718 0.063 1.624 ...
#> $ screen: num -0.615 -0.615 0.549 -0.615 -0.615 ...
#> $ ads : num -2.39 -2.39 -2.39 -2.39 -2.39 ...summary(trainComputers)
#> price speed hd ram
#> Min. :-2.2216 Min. :-1.1954 Min. :-1.31361 Min. :-1.10781
#> 1st Qu.:-0.7520 1st Qu.:-0.7834 1st Qu.:-0.68562 1st Qu.:-0.71755
#> Median :-0.1276 Median : 0.0920 Median :-0.07666 Median : 0.06296
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
#> 3rd Qu.: 0.6269 3rd Qu.: 0.9159 3rd Qu.: 0.44666 3rd Qu.: 0.06296
#> Max. : 5.2569 Max. : 2.6667 Max. : 8.29651 Max. : 4.74606
#> screen ads
#> Min. :-0.615 Min. :-2.3880
#> 1st Qu.:-0.615 1st Qu.:-0.4416
#> Median :-0.615 Median : 0.2444
#> Mean : 0.000 Mean : 0.0000
#> 3rd Qu.: 0.549 3rd Qu.: 0.6273
#> Max. : 2.877 Max. : 1.52073 Map A : Compress using vector quantization
This package can perform vector quantization using the following algorithms -
- Hierarchical Vector Quantization using k−means
- Hierarchical Vector Quantization using k−medoids
For more information on vector quantization, refer the following link.
The HVT function constructs highly compressed hierarchical Voronoi tessellations. The raw data is first scaled and this scaled data is supplied as input to the vector quantization algorithm. The vector quantization algorithm compresses the dataset until a user-defined compression percentage/rate is achieved using a parameter called quantization error which acts as a threshold and determines the compression percentage. It means that for a given user-defined compression percentage we get the ‘n’ number of cells, then all of these cells formed will have a quantization error less than the threshold quantization error.
Let’s try to comprehend the HVT function first before moving ahead.
HVT(
dataset,
min_compression_perc,
n_cells,
depth,
quant.err,
distance_metric = c("L1_Norm", "L2_Norm"),
error_metric = c("mean", "max"),
quant_method = c("kmeans", "kmedoids"),
normalize = TRUE,
diagnose = FALSE,
hvt_validation = FALSE,
train_validation_split_ratio = 0.8
)Each of the parameters have been explained below :
dataset- A dataframe with numeric columnsmin_compression_perc- An integer indicating the minimum percent compression rate to be achieved for the datasetn_cells- An integer indicating the number of cells per hierarchy (level)depth- An integer indicating the number of levels. (1 = No hierarchy, 2 = 2 levels, etc …)quant.error- A number indicating the quantization error threshold. A cell will only breakdown into further cells if the quantization error of the cell is above the defined quantization error thresholddistance_metric- The distance metric can beL1_NormorL2_Norm.L1_Normis selected by default. The distance metric is used to calculate the distance between anndimensional point and centroid. The user can also pass a custom function to calculate this distanceerror_metric- The error metric can bemeanormax.maxis selected by default.maxwill return the max ofmvalues andmeanwill take mean ofmvalues where each value is a distance between a point and centroid of the cell. Moreover, the user can also pass a custom function to calculate the error metricquant_method- The quantization method can bekmeansorkmedoids.kmeansis selected by defaultnormalize- A logical value indicating whether the columns in your dataset need to be normalized. Default value is TRUE. The algorithm supports Z-score normalizationdiagnose- A logical value indicating whether user wants to perform diagnostics on the model. Default value is TRUE.hvt_validation- A logical value indicating whether user wants to holdout a validation set and find mean absolute deviation of the validation points from the centroid. Default value is FALSE.train_validation_split_ratio- A numeric value indicating train validation split ratio. This argument is only used when hvt_validation has been set to TRUE. Default value for the argument is 0.8
First, we will construct map A by
performing Hierarchical Vector Quantization by setting the below
mentioned model parameters.
Model Parameters
- Number of Cells at each Level = 100
- Maximum Depth = 1
- Quantization Error Threshold = 0.46
- Error Metric = Max
- Distance Metric = Manhattan
map_A <- list()
map_A <-muHVT::HVT(trainComputers,
n_cells = 100,
quant.err = 0.46,
depth = 1,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
normalize = F
)As per the manual, map_A[[3]] gives us
detailed information about the hierarchical vector quantized data.
map_A[[3]][['summary']] gives a nice
tabular data containing no of points, Quantization Error and the
codebook.
The datatable displayed below is the summary from map A
DT::datatable(map_A[[3]]$summary, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now let us understand what each column in the above table means:
Segment.Level- Level of the cell. In this case, we have performed Vector Quantization for depth 1. Hence Segment Level is 1Segment.Parent- Parent segment of the cellSegment.Child (Cell.Number)- The children of a particular cell. In this case, it is the total number of cells at which we achieved the defined compression percentagen- No of points in each cellCell.ID- Cell_ID’s are generated for the multivariate data using 1-D Sammon’s Projection algorithmQuant.Error- Quantization Error for each cell
All the columns after this will contain centroids for each cell. They can also be called a codebook, which represents a collection of all centroids or codewords.
Now, let’s check the compression summary for HVT (map A). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.
mapA_compression_summary <- map_A[[3]]$compression_summary %>% dplyr::mutate_if(is.numeric, funs(round(.,4)))
DT::datatable(mapA_compression_summary, class = 'cell-border stripe', options = list(scrollX = TRUE))As it can be seen from the table above,
82% of the cells have hit the quantization
threshold error
Let’s plot the Voronoi tessellation for layer 1 (map A)
muHVT::plotHVT(map_A,
line.width = c(0.6),
color.vec = c("#141B41"),
centroid.size = 1.5,
maxDepth = 1) Let’s have a look at the function hvtHmap that we will
use to overlay features as heatmap.
hvtHmap(hvt.results, dataset, child.level, hmap.cols, color.vec ,line.width, palette.color = 6)hvt.results- A list of results obtained from the HVT function (map_A)dataset- A dataframe containing the variables to overlay as a heatmap. The user can pass an external dataset or the dataset that was used to perform hierarchical vector quantization. The dataset should have the same number of points as the dataset used to perform hierarchical Vector Quantization in the HVT functionchild.level- A number indicating the level for which the heat map is to be plottedhmap.cols- The column number of column name from the dataset indicating the variables for which the heat map is to be plotted. To plot the quantization error as heatmap, pass'quant_error'. Similarly to plot the no of points in each cell as heatmap, pass'no_of_points'as a parametercolor.vec- A color vector such that length(color.vec) = child.level (default = NULL)line.width- A line width vector such that length(line.width) = child.level (default = NULL)palette.color- A number indicating the heat map color palette. 1 - rainbow, 2 - heat.colors, 3 - terrain.colors, 4 - topo.colors, 5 - cm.colors, 6 - BlCyGrYlRd (Blue,Cyan,Green,Yellow,Red) color (default = 6)show.points- A boolean indicating whether the centroids should be plotted on the tessellations (default = FALSE)
Now let’s plot all the features for each cell at level one as a heatmap.
hmap <- list()
col_list <- colnames(trainComputers)
hmap <- lapply(1:length(col_list), function(x){
hvtHmap(
map_A,
scores,
child.level = 1,
hmap.cols = col_list[[x]],
line.width = c(0.2),
color.vec = c("#141B41"),
palette.color = 6,
centroid.size = 1.0,
show.points = T,
quant.error.hmap = 0.46,
n_cells.hmap = 100,
)
})grid.arrange(hmap[[1]], nrow = 1, ncol=1)grid.arrange(hmap[[2]], nrow = 1, ncol=1)grid.arrange(hmap[[3]], nrow = 1, ncol=1)grid.arrange(hmap[[4]], nrow = 1, ncol=1)grid.arrange(hmap[[5]], nrow = 1, ncol=1)grid.arrange(hmap[[6]], nrow = 1, ncol=1)4 Map B : Identify the novelties and construct a map on the dataset with novelty
The removeNovelty function removes the
identified novelty cell(s) from the dataset and stores those records
separately.
It takes input as the cell number (Segment.Child) of the manually
identified novelty cell(s) from the above table and the compressed HVT
map (map A). It returns a list of two items:
dataset with novelty records, and a subset of the
dataset without the novelty records.
identified_Novelty_cells <<- c(22, 32, 33)
output_list <- removeNovelty(identified_Novelty_cells, map_A)[1] “The following cell(s) have been removed as outliers from the dataset: 22 32 33”
data_with_novelty <- output_list[[1]]
dataset_without_novelty <- output_list[[2]]Note - In the dataset with novelty, the total number of cells would be equal to the total number of novelty cell(s) removed from the HVT map A. Let’s say, as in the above case where three cells (22nd, 32nd, and 33rd) are identified as the novelty and are removed. Then the 22th cell would be the first cell in the HVT map B, 32th as the second cell, and 33th as the third cell.
Let’s have a look at the data with novelties.
colnames(data_with_novelty) <- c("Cell.ID","Segment.Child","price","speed","hd","ram","screen","ads")
DT::datatable(data_with_novelty, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))4.1 Voronoi tessellation to highlight novelty cell in the map
The plotCells function is used to plot
the Voronoi tessellation using the compressed HVT map (map A) and
highlights the identified novelty cell(s) in red on the map.
plotCells(identified_Novelty_cells, map_A)Figure 3: The Voronoi Tessellation constructed using the compressed HVT map (map A) with the novelty cell(s) highlighted in red
We pass the dataframe with novelty records to HVT function along with other model parameters mentioned below to generate map B (layer 2).
Model Parameters
- Number of Cells at each Level = 3
- Maximum Depth = 1
- Quantization Error Threshold = 0.46
- Error Metric = Max
- Distance Metric = Manhattan
dataset_with_novelty <- data_with_novelty[,-1:-2]
map_B <- list()
map_B <- muHVT::HVT(dataset_with_novelty,
n_cells = 3,
depth = 1,
quant.err = 0.46,
projection.scale = 10,
normalize = F,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
diagnose = F)The datatable displayed below is the summary from map B (layer 2)
DT::datatable(map_B[[3]]$summary, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))5 Map C : Construct a map on the dataset without novelty
Construct another hierarchical Voronoi tessellation map C on the dataset without novelty setting the below mentioned model parameters.
- Number of Cells at each Level = 100
- Maximum Depth = 2
- Quantization Error Threshold = 0.46
- Error Metric = Max
- Distance Metric = Manhattan
map_C <- list()
map_C <- muHVT::HVT(dataset_without_novelty,
n_cells = 100,
depth = 2,
quant.err = 0.46,
projection.scale = 10,
normalize = F,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
diagnose = F)The datatable displayed below is the summary from map C (layer 2)
DT::datatable(map_C[[3]]$summary, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now let’s check the compression summary for HVT (map C). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.
mapC_compression_summary <- map_C[[3]]$compression_summary %>% dplyr::mutate_if(is.numeric, funs(round(.,4)))
DT::datatable(mapC_compression_summary, class = 'cell-border stripe', options = list(scrollX = TRUE))As it can be seen from the table above,
85% of the cells have hit the quantization
threshold error in level 1 and 100% of the
cells have hit the quantization threshold error in level 2
Now let’s plot all the features for each cell at level two as a heatmap.
hmap <- list()
col_list <- colnames(trainComputers)
hmap <- lapply(1:length(col_list), function(x){
hvtHmap(
map_C,
scores,
child.level = 2,
hmap.cols = col_list[[x]],
line.width = c(0.6,0.4),
color.vec = c("#141B41","#0582CA"),
palette.color = 6,
centroid.size = 1.0,
show.points = T,
quant.error.hmap = 0.46,
n_cells.hmap = 100,
) #%>% ggplotly() # for plotly
})grid.arrange(hmap[[1]], nrow = 1, ncol=1)grid.arrange(hmap[[2]], nrow = 1, ncol=1)grid.arrange(hmap[[3]], nrow = 1, ncol=1)grid.arrange(hmap[[4]], nrow = 1, ncol=1)grid.arrange(hmap[[5]], nrow = 1, ncol=1)grid.arrange(hmap[[6]], nrow = 1, ncol=1)6 Prediction on Test Data
Now once we have built the model, let us try to predict using our test dataset which cell and which layer each point belongs to.
The predictLayerHVT function is used to score the test data using the predictive set of maps. This function takes an input - a test data and a set of maps (map A, map B, map C).
Let’s have a look at our scaled test dataset.
# Quick peek
testComputers <- scale(testComputers, center = scale_attr$`scaled:center`, scale = scale_attr$`scaled:scale`) %>% as.data.frame()
DT::datatable(head(testComputers), rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now let us check the structure of the test data and analyse its summary
str(testComputers)
#> 'data.frame': 1253 obs. of 6 variables:
#> $ price : num -1.228 1.383 -0.802 0.23 0.308 ...
#> $ speed : num -0.783 0.092 0.092 2.667 0.916 ...
#> $ hd : num -0.676 3.063 -0.676 -0.41 1.731 ...
#> $ ram : num -0.718 3.185 -0.718 -0.718 1.624 ...
#> $ screen: num 0.549 0.549 -0.615 -0.615 0.549 ...
#> $ ads : num -0.84 -0.84 -0.84 -0.84 -0.84 ...summary(testComputers)
#> price speed hd ram
#> Min. :-2.2216 Min. :-0.7834 Min. :-1.0995 Min. :-1.10781
#> 1st Qu.:-1.0368 1st Qu.: 0.0920 1st Qu.: 0.3420 1st Qu.: 0.06296
#> Median :-0.6167 Median : 0.9159 Median : 0.8986 Median : 0.06296
#> Mean :-0.4276 Mean : 0.9755 Mean : 1.4372 Mean : 0.60148
#> 3rd Qu.: 0.1228 3rd Qu.: 1.3793 3rd Qu.: 2.3497 3rd Qu.: 1.62400
#> Max. : 2.8957 Max. : 2.6667 Max. : 8.2965 Max. : 4.74606
#> screen ads
#> Min. :-0.6150 Min. :-3.2654
#> 1st Qu.:-0.6150 1st Qu.:-1.8296
#> Median : 0.5490 Median :-1.4627
#> Mean : 0.4672 Mean :-1.7811
#> 3rd Qu.: 0.5490 3rd Qu.:-1.2872
#> Max. : 2.8768 Max. : 0.7708Now, Let us understand the
predictLayerHVT function -
predictLayerHVT(data,
map_A,
map_B,
map_C,
mad.threshold = 0.2,
normalize = T,
distance_metric="L1_Norm",
error_metric="max",
child.level = 1,
line.width = c(0.6, 0.4, 0.2),
color.vec = c("#141B41", "#6369D1", "#D8D2E1"),
yVar= NULL,
...)Each of the parameters of predictLayerHVT function has been explained below:
data- A dataframe containing the test dataset. The dataframe should have atleast one variable used for training. The variables from this dataset can also be used to overlay as heatmapmap A- A list of hvt.results.model obtained from HVT function while performing hierarchical vector quantization on train datamap B- A list of hvt.results.model obtained from HVT function while performing hierarchical vector quantization on train data with noveltymap C- A list of hvt.results.model obtained from HVT function while performing hierarchical vector quantization on train data without noveltychild.level- A number indicating the layer for which the heat map is to be plotted.(Only used if hmap.cols is not NULL)mad.threshold- A numeric values indicating the permissible Mean Absolute Deviationnormalize- A logical value indicating if the columns in your dataset should be normalized. Default value is TRUE.distance_metric- The distance metric can be ’Euclidean” or “Manhattan”. Euclidean is selected by default.error_metric- The error metric can be “mean” or “max”. mean is selected by defaultyVar- Name of the dependent variable(s)...- color.vec and line.width can be passed from here
The function predicts based on the HVT maps - map A, map B and map C, constructed using HVT function. For each test record, the function will assign that record to Layer1 or Layer2. Layer1 contains the cell ids from map A and Layer 2 contains cell ids from map B (novelty map) and map C (map without novelty).
Note : The prediction algorithm will not work if some of the variables used to perform quantization are missing. In the test dataset, we should not remove any features
Let’s see which cell and layer each point belongs to.
validation_data <- testComputers
new_predict <- predictLayerHVT(
data=validation_data,
map_A,
map_B,
map_C,
normalize = F
)
DT::datatable(new_predict, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Note: From the above table, we can see that 954th observation from the test data having Cell.ID as A99 have been identified as novelty and is mapped to 1st(B1) novelty cell in Layer2.Cell.ID column .Similarly, 390th,413th,421th,432nd,530th……. test record having Cell.ID as A94 is correctly identified as novelty and gets mapped to 3rd(B3) novelty cell in the Layer2.Cell.ID column respectively
7 Executive Summary
We have considered computers dataset for creating a predictive set of maps to monitor entities over time using predictLayerHVT() in this notebook
We construct a compressed HVT map (Map A) using the HVT() on the training dataset by setting
n_cellsto 100 andquant.errorto 0.46Based on the output of the above step, we manually identify the novelty cell(s) from the plotted map A. For this dataset, we identify the 22nd, 32nd and 33rd cells as the novelty cell.
We pass the identified novelty cell(s) as a parameter to the removeNovelty() along with HVT map A. The function removes that novelty cell(s) from the dataset and stores them separately. It also returns the dataset without novelty(s).
The plotCells() constructs hierarchical voronoi tessellations and highlights the identified novelty cell(s) in red
The dataset with novelty is then passed to the HVT() to construct another HVT map (map B). But here, we set the parameters
n_cells= 3,depth= 1 etc. when constructing the mapThe dataset without novelties is then passed to the HVT() to construct another HVT map (map C). But here, we set the parameters
n_cells= 100,depth= 2 etc. when constructing the map.Finally, the set of maps - map A, map B, and map C are passed to the predictLayerHVT() along with the test dataset to predict which map and what cell each test record is assigned to.
The output of predictLayerHVT is a dataset with two columns Layer1.Cell.ID and Layer2.Cell.ID. Layer1.Cell.ID contains cell ids from map A in the form A1,A2,A3…. and Layer2.Cell.ID contains cell ids from map B as B1,B2… depending on the identified novelties and map C as C1,C2,C3…..
From the output of
predictLayerHVTwe can see that 954th observation from the test data having Cell.ID as A99 have been identified as novelty and is mapped to 1st(B1) novelty cell in Layer2.Cell.ID column .Similarly, 390th,413th,421th,432nd,530th……. test record having Cell.ID as A94 is correctly identified as novelty and gets mapped to 3rd(B3) novelty cell in the Layer2.Cell.ID column respectively .
8 References
Topology Preserving Maps : https://link.springer.com/chapter/10.1007/1-84628-118-0_7
Vector Quantization : https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-450-principles-of-digital-communications-i-fall-2006/lecture-notes/book_3.pdf
Sammon’s Projection : http://en.wikipedia.org/wiki/Sammon_mapping
Voronoi Tessellations : http://en.wikipedia.org/wiki/Centroidal_Voronoi_tessellation